2024-04-16
Cross-validation is a cornerstone methodology in the field of data science, essential for assessing the predictive performance of statistical models and ensuring their generalizability to unseen data. This resampling technique allows researchers to evaluate how models will perform in practice, addressing critical challenges like overfitting and underfitting, thereby ensuring robustness and reliability of model predictions across various data subsets.
This is one of the more simple methods of cross-validation and tends to be less time-consuming to conduct (Yadav and Shukla 2016). It divides data into a training set and a test set, usually in a 70-30 or 90-10 split.
Using random sampling without replacement from certain probability distributions when splitting your data sets (James 1980). It aims to decrease overfitting by allowing the algorithm to better explore the experimental space.
The dataset is split into k equally sized folds. Each fold is used once as a test set while the rest serve as the training set.
This is a special case of k-fold cross-validation where k equals the number of data points (Shao and Er 2016). Each data point is used once as a test set. In a sense, Leave-One-Out Cross-Validation (LOOCV) is equivalent to setting k=1 in k-Fold Cross-Validation.
Zhang and Yang (2015) focus on cross-validation as a model selection method.
Wong and Yeh (2020) examine the reliability of accuracy estimates derived from k-fold cross-validation.
Marcot and Hanea (2021) confirmed that 10 folds is the standard, but 5-folds can also work in some cases.
Filzmoser et al. (2009) explored repeated double cross-validation (rdCV).
Xu et al. (2018) introduce Representative Splitting Cross-Validation (RSCV).
Rabinowicz & Rosset (2022) propose a bias-correction measure, CVc to explore the influences of correlational structure on cross-validation effectiveness.
Lei (2020) introduce Cross-Validation with Confidence (CVC) to try to reduce overfitting.
Single hold-out cross-validation is one more simple versions of cross-validation being used (Yadav and Shukla 2016).
Randomly and evenly split the dataset into k-folds.
Use k-1 folds as the training set to fit the model.
Predict the response variable in the hold-out (kth) fold using the fitted model.
Calculate the prediction error for the response variable in the hold-out fold.
Repeat steps 2-4 k times, using each fold as a hold-out once.
Calculate the overall test predictors by averaging all k test predictors.
Source:(Ranjan 2021)
The MSE calculation using K-folds is as follows (Jung and Hu 2015):
\[ CV(k) = \frac{1}{k} \sum_{i=1}^{k} MSE_i \tag{1} \]
MCCV randomly removes significant chunks of the dataset without replacement to use as its training and validation (\(s_v\)) data sets (Haddad et al. 2013). This sampling is repeated N times. The criterion for this techniques, according to (Haddad et al. 2013), is as follows:
\[ MCCV_{n_v}(\phi)=\frac{1}{Nn_v}\sum_{i=1}^{k}(y_{(s_v)(i)}-\hat{y}_{\phi(s_v)(i)})^2 \tag{2} \]
The MSE is the expectation of the squared norm of the difference between the estimated and the true parameter vectors, and can be calculated as:
\[ MSE = \sum_{i=1}^{p} (\hat{\beta}_i - \beta_i)^2 \tag{3} \]
where \(\hat{\beta}_i\) is the estimated value of the \(i\)-th parameter, \(\beta_i\) is the true value of the \(i\)-th parameter, and \(p\) is the total number of parameters (Jung and Hu 2015).
Another performance metric that will be used to determine model performance is the area under the ROC curve (AUC).
Formula 4 demonstrate how AUC uses the positive and negative outcomes to calculate a value that indicates the probability of a positive result ranking higher than negative results, where \(S_p\) is the total number of positive outcomes ranked, and \(n_p\) and \(n_n\) are the number of positive and negative outcomes, respectively (Hossin and Sulaiman 2015).
\[ AUC = \frac{S_p-n_p(n_n+1)/2}{n_pn_n} \tag{4} \]
Source:(“How to Explain the ROC Curve and ROC AUC Score?” 2024)
| Variable | Type | Characteristic |
|---|---|---|
| Age | Predictor | Integer |
| Anaemia | Predictor | Binary (1 = anemic, 0 = not) |
| creatine_phosphokinase | Predictor | Integer |
| Diabetes | Predictor | Binary (1 = diabetic, 0 = not) |
| ejection_fraction | Predictor | Integer |
| high_blood_pressure | Predictor | Binary (1 = HBP present, 0 = no HBP) |
| Platelets | Predictor | Continuous |
| serum_creatinine | Predictor | Continuous |
| serum_sodium | Predictor | Integer |
| Sex | Predictor | Binary (1 = M, 0 = F) |
| Smoking | Predictor | Binary (1 = smoker, 2 = not a smoker) |
| Time | Predictor | Integer |
| death_event | Response | Binary (1 = death, 0 = no death) |
age anaemia creatinine_phosphokinase
0 0 0
diabetes ejection_fraction high_blood_pressure
0 0 0
platelets serum_creatinine serum_sodium
0 0 0
sex smoking time
0 0 0
DEATH_EVENT
0
We found the total number of missing values for each variable, and none were found. The analysis can continue without deleting rows or imputation of values.
Since the outcome variable (death event) is binary, logistic regression will be used with five different cross-validation techniques: 10-fold, 5-fold, single hold-out, leave-one-out, and Monte Carlo cross-validation. Two different models will be fit for each technique: one with every predictor variable included (Model 1), and the other using only serum creatinine and ejection fraction (Model 2), as was advised by (Chicco and Jurman 2020). The goal is to compare how the difference in performance between the two models changes when various cross-validation techniques are applied. The performance of the models will be determined using their MSE and AUC values.
With any variation of k-fold cross validation, the dataset is split into k sized folds that will all be equal. The function used to create our cross-validation was cv.glm, which only requires your data, the glm model, and the number of folds for the cross validation. Here, the dataset is split in 10 folds. As previously discussed, (Wong and Yeh 2019) mentions 10-fold cross validation analysis to be most effective in most cases. Our glm model was changed from a linear regression to logistic, since our outcome variable is binary.
In model 1, all predictor variables are included. The default metric for cv.glm is MSE, so a cost function was created to calculate the AUC values for the models. This cost function will be used for both k-folds and LOOCV.
library(boot)
auc <- function(obs, pred.p){
pred <- prediction(pred.p, obs)
perf <- performance(pred, "tpr", "fpr")
cost = unlist(slot(performance(pred, "auc"), "y.values"))
return(cost)
}
#Model 1
set.seed(43269)
m1 <- glm(DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + diabetes + ejection_fraction + high_blood_pressure + platelets + serum_creatinine + serum_sodium + sex + smoking + time,
data = data,
family = "binomial"(link = "logit"))
k10_m1_mse <- cv.glm(data, m1, K=10)$delta
k10_m1_auc <- cv.glm(data, m1, K=10, cost=auc)$deltaIn model 2, only ejection fraction and serum creatinine are included, as the researchers that provided the data set came to the conclusion that death events could be predicted using those 2 variables alone.
Interestingly enough, the first model that included all variables produced a lower MSE and a higher AUC, indicating that from this CV test, model one would be better to use than model 2. Model 2 has an AUC that is still acceptable, but the AUC of the first model is significantly better.
# A tibble: 1 × 4
M1_K5_MSE M2_K5_MSE M1_K5_AUC M2_K5_AUC
<dbl> <dbl> <dbl> <dbl>
1 0.133 0.182 0.868 0.756
Similar to the 10-fold models above, our dataset is split into equal pieces using cv.glm. However, in this case, we are only doing 5 folds compared to our 10-fold analysis. Again, in model 1, all predictor variables are included.
library(boot)
set.seed(143269)
m1 <- glm(DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + diabetes + ejection_fraction + high_blood_pressure + platelets + serum_creatinine + serum_sodium + sex + smoking + time,
data = data,
family = "binomial"(link = "logit"))
k5_m1_mse <- cv.glm(data, m1, K=5)$delta
k5_m1_auc <- cv.glm(data, m1, K=5, cost=auc)$deltaIn model 2, only ejection fraction and serum creatinine are included.
Model 1 once again performed the best based on its MSE and AUC values. Both AUC values are acceptable, and their MSE’s are fairly low. However, model 1’s AUC is significantly better.
# A tibble: 1 × 4
M1_K5_MSE M2_K5_MSE M1_K5_AUC M2_K5_AUC
<dbl> <dbl> <dbl> <dbl>
1 0.136 0.180 0.872 0.759
Using the Single Hold-Out Method, the data set is split into a training set and a test/validation set. Since the cv.glm function only applies to k-fold cross-validation, the splitting of the data will be done manually. Here, we are using a 50-50 split. Our functions for creating each model was identical to our previous techniques, using glm() to perform logistic regression.
#add obs # for each row
library(dplyr)
data_SHO <- data %>%
mutate(obs = row_number()) %>%
relocate(obs)
#split
library(rsample)
set.seed(726159)
split_SHO <- initial_split(data_SHO, prop = 0.5)
training_SHO <- training(split_SHO)
validation_SHO <- testing(split_SHO)
#create SHO models with training set
m1_SHO <- glm(DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + diabetes + ejection_fraction + high_blood_pressure + platelets + serum_creatinine + serum_sodium + sex + smoking + time,
data = training_SHO,
family = "binomial"(link = "logit"))
m2_SHO <- glm(DEATH_EVENT ~ ejection_fraction + serum_creatinine,
data = training_SHO,
family = "binomial"(link = "logit"))The models were applied to the validation set to find the predicted outcome values. The means of the squared errors of our predicted values was found for each model. To calculate the AUC values, we only needed to use the auc() function with the original outcome variable and the predicted outcome variables of our validation set.
#calculate MSE
validation_SHO <- validation_SHO %>%
mutate(yhat_m1_SHO = predict(m1_SHO, newdata = validation_SHO),
yhat_m2_SHO = predict(m2_SHO, newdata = validation_SHO)) %>%
mutate(sqerr_m1_SHO = (yhat_m1_SHO - DEATH_EVENT)^2,
sqerr_m2_SHO = (yhat_m2_SHO - DEATH_EVENT)^2) %>%
relocate(obs, DEATH_EVENT, yhat_m1_SHO, sqerr_m1_SHO, yhat_m2_SHO, sqerr_m2_SHO)
SHO_m1_mse <- mean(validation_SHO$sqerr_m1_SHO, na.rm = TRUE)
SHO_m2_mse <- mean(validation_SHO$sqerr_m2_SHO, na.rm = TRUE)
#Calculate AUC
SHO_m1_auc = auc(validation_SHO$DEATH_EVENT,validation_SHO$yhat_m1_SHO)
SHO_m2_auc = auc(validation_SHO$DEATH_EVENT,validation_SHO$yhat_m2_SHO)After conducting the analysis, it can be seen that model 2 (m2) provides us with a significantly lower mean squared error. Both models have very similar AUC values, with the first model only producing a marginally higher value. M2 includes only ejection fraction and serum creatinine as predictor variables, meaning this reduced model is a better fit according to this cross-validation technique. This result aligns with the opinion of the original article this data was sourced from.
# A tibble: 1 × 4
M1_SHO_MSE M2_SHO_MSE M1_SHO_AUC M2_SHO_AUC
<dbl> <dbl> <dbl> <dbl>
1 7.45 1.69 0.837 0.820
LOOCV follows a similar data set splitting technique to the previous k-fold examples, but in this case, the data is split k-ways, where k is equal to the number of data points. In this case, k would be 299. Since LOOCV is a form of k-folds, cv.glm can be used for data splitting. The “K=” argument isn’t necessary when performing LOOCV, as the function assumes that the number of folds will be equal to the number of observations. Once again, the cost function is used to find our AUC values. The same function (glm()) was used to fit our logistic regression models as the other k-fold techniques.
Once again, the processes for both models was identical apart from the number of predictors included in the models.
cost <- function(labels,pred){
mean(labels==ifelse(pred > 0.5, 1, 0))
}
# Model 1
set.seed(7432)
m1 <- glm(DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + diabetes + ejection_fraction + high_blood_pressure + platelets + serum_creatinine + serum_sodium + sex + smoking + time,
data = data,
family = "binomial"(link = "logit"))
loocv_m1_mse <- cv.glm(data, m1)$delta
loocv_m1_auc <- cv.glm(data, m1, cost = cost)$delta
# Model 2
set.seed(20987)
m2 <- glm(DEATH_EVENT ~ ejection_fraction + serum_creatinine,
data = data,
family = "binomial"(link = "logit"))
loocv_m2_mse <- cv.glm(data, m2)$delta
loocv_m2_auc <- cv.glm(data, m2, cost = cost)$deltaAccording to LOOCV, the full model (M1) is a better fit due to a lower MSE compared to the reduced model (M2). The AUC models follow a similar pattern, with model 1 having a higher AUC than model 2.
# A tibble: 1 × 4
`M1 LOOCV MSE` `M2 LOOCV MSE` `M1 LOOCV AUC` `M2 LOOCV AUC`
<dbl> <dbl> <dbl> <dbl>
1 0.134 0.182 0.827 0.742
Unlike previous techniques, a mc_cv function was created to repeat the splitting of the data many times. Our proportion of training to validation sets is the standard 70/30 split, and the data will be resampled 100 times. Workflows were used to specify our model type (logistic regression), as and plug in the variables for each of our models.
data$DEATH_EVENT=as.factor(data$DEATH_EVENT)
set.seed(30)
resample1 <- mc_cv(data, times = 100, prop = .7)
mccv_model <- logistic_reg() %>% #creating classification model
set_engine("glm") %>%
set_mode("classification")
m1_wflow <- #adding variables for model 1
workflow() %>%
add_formula(
DEATH_EVENT ~ .) %>%
add_model(mccv_model)
m2_wflow <- # adding variables for model 2
workflow() %>%
add_formula(
DEATH_EVENT ~ ejection_fraction + serum_creatinine) %>%
add_model(mccv_model)The fit_resamples() function applies our resampling function created earlier to our models to perform our cross-validation. For the MSE of the models, we’ll be using the Brier score as our resampling fitting function generates the Brier score automatically, and it performs similarly to MSE in classification problems (Ferro and Fricker 2012).
keep_pred <- control_resamples(save_pred = TRUE, save_workflow = TRUE)
set.seed(30)
m1_MCCV <-m1_wflow %>% fit_resamples(resamples = resample1, control = keep_pred)
m1_MCCV_metrics <- collect_metrics(m1_MCCV)
mccv_m1_mse <- as.numeric(m1_MCCV_metrics[2,3])
mccv_m1_auc <- as.numeric(m1_MCCV_metrics[3,3])
m2_MCCV <- m2_wflow %>% fit_resamples(resamples = resample1, control = keep_pred)
m2_MCCV_metrics <-collect_metrics(m2_MCCV)
mccv_m2_mse <- as.numeric(m2_MCCV_metrics[2,3])
mccv_m2_auc <- as.numeric(m2_MCCV_metrics[3,3])Similar to the previously discussed techniques, our full model had the lower MSE/Brier score and higher AUC. The smaller model’s AUC of 0.763 is still acceptable as far as the metric’s ideal scores go, but the increase of about 0.1 indicates there might be another predictor or two that could contribute significantly to the model.
# A tibble: 1 × 4
M1_MCCV_MSE M2_MCCV_MSE M1_MCCV_AUC M2_MCCV_AUC
<dbl> <dbl> <dbl> <dbl>
1 0.139 0.183 0.864 0.763
CV M1_MSE M1_AUC M2_MSE M2_AUC
1 10-Fold 0.1327269 0.8675932 0.1816103 0.7559824
2 5-Fold 0.1355044 0.8719124 0.1798165 0.7585011
3 Single Hold-Out 7.4504247 0.8374603 1.6855614 0.8203175
4 LOOCV 0.1342105 0.8271048 0.1815950 0.7423519
5 MCCV 0.1393454 0.8644844 0.1831616 0.7633794
The only technique that stands out as having significantly different results was single hold-out cross-validation. It’s the only technique that clearly indicated the reduced model as the superior model. The significant difference leads us to believe the simplicity of the technique could have caused some overfitting and higher bias. Other training/testing split ratios could be considered to try to combat this. The rest of the techniques performed so similarly with all measures considered, that choosing the technique that best fit and predicted our data is difficult.